predictor space
Beyond Random Split for Assessing Statistical Model Performance
Catania, Carlos, Guerra, Jorge, Romero, Juan Manuel, Caffaratti, Gabriel, Marchetta, Martin
Even though a train/test split of the dataset randomly performed is a common practice, could not always be the best approach for estimating performance generalization under some scenarios. The fact is that the usual machine learning methodology can sometimes overestimate the generalization error when a dataset is not representative or when rare and elusive examples are a fundamental aspect of the detection problem. In the present work, we analyze strategies based on the predictors' variability to split in training and testing sets. Such strategies aim at guaranteeing the inclusion of rare or unusual examples with a minimal loss of the population's representativeness and provide a more accurate estimation about the generalization error when the dataset is not representative. Two baseline classifiers based on decision trees were used for testing the four splitting strategies considered. Both classifiers were applied on CTU19 a low-representative dataset for a network security detection problem. Preliminary results showed the importance of applying the three alternative strategies to the Monte Carlo splitting strategy in order to get a more accurate error estimation on different but feasible scenarios.
Implementing a Decision Tree From Scratch
Tree-based methods are simple and useful for interpretation since the underlying mechanisms are considered quite similar to human decision-making. The methods involve stratifying or segmenting the predictor space into a number of simpler regions. When making a prediction, we simply use the mean or mode of the region the new observation belongs to as a response value. Since the splitting rules to segment the predictor space can be best described by a tree-based structure, the supervised learning algorithm is called a Decision Tree. Decision trees can be used for both regression and classification tasks.
On Margin Maximization in Linear and ReLU Networks
Vardi, Gal, Shamir, Ohad, Srebro, Nathan
The implicit bias of neural networks has been extensively studied in recent years. Lyu and Li [2019] showed that in homogeneous networks trained with the exponential or the logistic loss, gradient flow converges to a KKT point of the max margin problem in the parameter space. However, that leaves open the question of whether this point will generally be an actual optimum of the max margin problem. In this paper, we study this question in detail, for several neural network architectures involving linear and ReLU activations. Perhaps surprisingly, we show that in many cases, the KKT point is not even a local optimum of the max margin problem. On the flip side, we identify multiple settings where a local or global optimum can be guaranteed. Finally, we answer a question posed in Lyu and Li [2019] by showing that for non-homogeneous networks, the normalized margin may strictly decrease over time.
Beyond Average Performance -- exploring regions of deviating performance for black box classification models
Torgo, Luis, Azevedo, Paulo, Areosa, Ines
Machine learning models are becoming increasingly popular in different types of settings. This is mainly caused by their ability to achieve a level of predictive performance that is hard to match by human experts in this new era of big data. With this usage growth comes an increase of the requirements for accountability and understanding of the models' predictions. However, the degree of sophistication of the most successful models (e.g. ensembles, deep learning) is becoming a large obstacle to this endeavour as these models are essentially black boxes. In this paper we describe two general approaches that can be used to provide interpretable descriptions of the expected performance of any black box classification model. These approaches are of high practical relevance as they provide means to uncover and describe in an interpretable way situations where the models are expected to have a performance that deviates significantly from their average behaviour. This may be of critical relevance for applications where costly decisions are driven by the predictions of the models, as it can be used to warn end users against the usage of the models in some specific cases.
Top 5 Statistical Data Analysis Techniques a Data Scientist Should Know
Statistical data analysis is a procedure of performing various statistical operations. It is a kind of quantitative research, which seeks to quantify the data, and typically, applies some form of statistical analysis. Quantitative data involves descriptive data, such as survey data and observational data. Statistical data analysis generally involves some form of statistical tools, which a layman cannot perform without having any statistical knowledge. Linear Regression, is the technique that is used to predict a target variable by providing the best linear relationship among the dependent and independent variables where best fit indicates the sum of all the distances amidst the shape and actual observations at each data point is as minimum as achievable.
Implicit Variational Inference: the Parameter and the Predictor Space
Pequignot, Yann, Alain, Mathieu, Dallaire, Patrick, Yeganehparast, Alireza, Germain, Pascal, Desharnais, Josée, Laviolette, François
Having access to accurate confidence levels along with the predictions allows to determine whether making a decision is worth the risk. Under the Bayesian paradigm, the posterior distribution over parameters is used to capture model uncertainty, a valuable information that can be translated into predictive uncertainty. However, computing the posterior distribution for high capacity predictors, such as neural networks, is generally intractable, making approximate methods such as variational inference a promising alternative. While most methods perform inference in the space of parameters, we explore the benefits of carrying inference directly in the space of predictors. Relying on a family of distributions given by a deep generative neural network, we present two ways of carrying variational inference: one in \emph{parameter space}, one in \emph{predictor space}. Importantly, the latter requires us to choose a distribution of inputs, therefore allowing us at the same time to explicitly address the question of \emph{out-of-distribution} uncertainty. We explore from various perspectives the implications of working in the predictor space induced by neural networks as opposed to the parameter space, focusing mainly on the quality of uncertainty estimation for data lying outside of the training distribution. We compare posterior approximations obtained with these two methods to several standard methods and present results showing that variational approximations learned in the predictor space distinguish themselves positively from those trained in the parameter space.
Inside: Logistic Regression
This is a part of a series of blogs where I'll be demonstrating different aspects and the theory of Machine Learning Algorithms by using math and code. This includes the usual modeling structure of the algorithm and the intuition on why and how it works, using Python code. Logistic Regression is one of the first algorithms that is introduced when someone learns about classification. You probably would have read about Regression and the continuous nature of the predictor variable. Classification is done on discrete variables, which means your predictions are finite and class-based like a Yes/No, True/False for binary outcomes.
Predicting into unknown space? Estimating the area of applicability of spatial prediction models
Predictive modelling using machine learning has become very popular for spatial mapping of the environment. Models are often applied to make predictions far beyond sampling locations where new geographic locations might considerably differ from the training data in their environmental properties. However, areas in the predictor space without support of training data are problematic. Since the model has no knowledge about these environments, predictions have to be considered uncertain. Estimating the area to which a prediction model can be reliably applied is required. Here, we suggest a methodology that delineates the "area of applicability" (AOA) that we define as the area, for which the cross-validation error of the model applies. We first propose a "dissimilarity index" (DI) that is based on the minimum distance to the training data in the predictor space, with predictors being weighted by their respective importance in the model. The AOA is then derived by applying a threshold based on the DI of the training data where the DI is calculated with respect to the cross-validation strategy used for model training. We test for the ideal threshold by using simulated data and compare the prediction error within the AOA with the cross-validation error of the model. We illustrate the approach using a simulated case study. Our simulation study suggests a threshold on DI to define the AOA at the .95 quantile of the DI in the training data. Using this threshold, the prediction error within the AOA is comparable to the cross-validation RMSE of the model, while the cross-validation error does not apply outside the AOA. This applies to models being trained with randomly distributed training data, as well as when training data are clustered in space and where spatial cross-validation is applied. We suggest to report the AOA alongside predictions, complementary to validation measures.